1. Disaster-planning strategies
Ask three different people what their idea of a disaster is and
you’ll probably get three different answers. For most administrators,
the term “disaster” probably means any scenario in which one or more
essential system services cannot operate and the prospects for quick
recovery are less than hopeful—that is, a disaster is something a
service reset or system reboot won’t fix.
To ensure that operations can be restored as quickly as possible in a
given situation, every network needs a clear disaster recovery plan.
Many of the same concepts go into disaster
planning as when you are planning for highly available, scalable, and
manageable systems. Why? Because, at the end of the day, disaster planning involves implementing plans that ensure the availability of systems and services. Remember that part of disaster
planning is applying some level of contingency planning to every
essential network service and system. You need to implement problem
escalation and response procedures. You also need a standing
problem-resolution document that describes in great detail what to do
when disaster strikes.
Developing contingency procedures
You should identify the services and systems that are essential to
network operations. Typically, this list will include the following
components:
-
Network infrastructure servers running Active Directory, Domain Name
System (DNS), Dynamic Host Configuration Protocol (DHCP), Remote Desktop
Services, and Routing and Remote Access Service (RRAS) -
File, database, and application servers, such as servers with
essential file shares or those that provide database or email services -
Networking hardware, including switches, routers, and firewalls
Combine your availability, scalability, and manageability plans with plans for contingency procedures in the following areas:
-
Physical security
Place network
hardware and servers in a locked, secure access facility. This could be
an office that is kept locked or a server room that requires a passkey
to enter. When physical access to network hardware and servers requires
special access privileges, you prevent many problems and ensure that
only authorized personnel can get access to systems from the console. -
Data backup
Implement a regular backup plan that ensures that multiple datasets are
available for all essential systems, and that these backups are stored
in more than one location. For example, if you keep the most current
backup sets on-site in the server room, you should rotate another backup
set to off-site storage. In this way, if disaster strikes, you will be more likely to be able to recover operations. -
Fault tolerance
Build redundancy into the network and system architecture. At the
server level, you can protect data using a redundant array of
independent disks (RAID) and guard against component failure by having
spare parts on hand. These precautions protect servers at a very basic
level. -
Recovery
Every essential
server and network device should have a written recovery plan that
details step by step what to do to rebuild and recover it. Be as
detailed and explicit as possible, and don’t assume that the readers
know anything about the system or device they are recovering. Do this even if you are sure that you’ll be the one performing the recovery—you’ll
be thankful for it, trust me. Things can and do go wrong at the worst
times, and sometimes, under pressure, you might forget some important
detail in the recovery process—not to mention that you might be
unavailable to recover the system for some reason. -
Power protection
Power-protect servers and network hardware using an uninterruptible power supply (UPS) system. Power
protection will help safeguard servers and network hardware from power
surges and dirty power. Power protection will also help prevent data
loss and allow you to power down servers in an appropriate fashion
through manual or automatic shutdown.
Implementing problem-escalation and response procedures
As part of planning, you need to develop well-defined problem-escalation
procedures that document how to handle problems and emergency changes
that might be needed. You need to designate an incident response team and an emergency response team. Although the two teams could consist of the same team members, the teams differ in fundamental ways:
-
Incident response team
The incident response
team’s role is to respond to security incidents, such as the suspected
cracking of a database server. This team is concerned with responding
to an intrusion, taking immediate action to safeguard the
organization’s information, documenting the security issue thoroughly in
an after-action report, and then fixing the security problem so that
the same type of incident cannot recur. Your organization’s security
administrator or network security expert should have a key role in this
team. -
Emergency response team
The emergency
response team’s role is to respond to service and system outages, such
as the failure of a database server. This team is concerned with
recovering the service or system as quickly as possible and allowing
normal operations to resume. Like the incident response team, the
emergency response team needs to document the outage thoroughly in an
after-action report, and then, if applicable, propose changes to improve
the recovery process. Your organization’s system administrators should
have key roles in this team.
Creating a problem-resolution policy document
Over the years, I’ve worked with and consulted for many
organizations, and I’ve often been asked to help implement information
technology (IT) policies and procedures. In the area of disaster
and recovery planning, there’s one policy document that I always use,
regardless of the size of the company I am working with. I call it the problem-resolution policy document.
The problem-resolution policy document has the following six sections:
-
Responsibilities The overall responsibilities of IT and engineering
staff during and after normal business hours should be detailed in this
section. For an organization with 24/7 operations, such as a company
with a public World Wide Web site maintained by internal staff, the
after-hours responsibilities section should be very detailed and let
individuals know exactly what their responsibilities are. Most
organizations with 24/7 operations will designate individuals as being
“on call” 7 days a week, 365 days a year, and in that case, this section
should detail what being “on call” means and what the general
responsibilities are for an individual on call. -
Phone roster Every
system and service you identify in your planning as essential should
have a point of contact. For some systems, you’ll have several points
of contact. Consider, for example, a database server. You might have a
system administrator who is responsible for the server itself, a
database administrator who is responsible for the database running on
the server, and an integration specialist responsible for any
integration components running on the server.
Important
The phone roster should include both on-site and off-site contact
numbers. Ideally, this means that you’ll have the work phone number,
cell phone number, and pager number of each contact. It should be the
responsibility of every individual on the phone roster to ensure that
contact information is up to date.
-
Key contact information
In addition to a phone roster, you should have contact numbers for
facilities and vendors. The key contacts list should include the main
office phone numbers at branch offices and data centers and contact
numbers for the various vendors that installed infrastructure at each
office, such as the building manager, Internet service provider (ISP),
electrician, and network wiring specialist. It should also include the
support phone numbers for hardware and software vendors and the
information you’ll be required to give in order to get service, such as
customer identification number and service contract information. -
Notification procedures
The way problems get resolved is through notification. This section
should outline the notification procedures and the primary point of
contact in case of outage. If many systems and services are involved,
notification and primary contacts can be divided into categories. For
example, you might have an external systems-notification process for
your public Internet servers and an internal systems-notification
process for your intranet services. -
Escalation
When problems
aren’t resolved within a specific timeframe, there should be clear
escalation procedures that detail whom to contact and when. For example,
you might have level 1, level 2, and level 3 points
of contact, with level 1 contacts being called immediately, level 2
contacts being called when issues aren’t resolved in 30 minutes, and
level 3 contacts being called when issues aren’t resolved in 60 minutes.
Important
You should also have a priority system in place that dictates what types of incidents or outages
take precedence over others. For example, you could specify that
service-level outages, such as those that involve the complete system,
have priority over an isolated outage involving a single server or
application, but that suspected security incidents have priority over
all other issues.
-
Post-action reporting
Every individual involved in a major outage or incident should be
expected to write a post-action report. This section details what should
be in that report. For example, you would want to track the
notification time, actions taken after notification, escalation
attempts, and other items that are important to improving the process or
preventing the problem from recurring.
Every IT group should have a general policy with regard to problem-resolution procedures, and this policy should be detailed in a problem-resolution
policy document or one like it. The document should be distributed to
all relevant personnel throughout the organization so that every person
who has some level of responsibility for ensuring system and service
availability knows what to do in the case of an emergency. After you
implement the policy, you should test it to help refine it so that the
policy will work as expected in an actual disaster.
2. Disaster preparedness procedures
Just as you need to perform planning before disaster
strikes, you also need to perform certain predisaster preparation
procedures. These procedures ensure that you are able to recover systems
as quickly as possible when a disaster strikes and include the
following:
You should perform regular backups of every server. Backups can be
performed using several techniques. Most organizations choose a
combination of dedicated backup servers and per-server
backups. If you use professional backup software, you can use one or
more dedicated backup servers to create backups of other servers on the
network, and then write the backups to media on centralized backup
devices. If you use per-server backups, you run backup software on each
server that you want to back up and store the backup media on a local
backup device. By combining the techniques, you get the best of both
worlds.
With dedicated backup servers, you purchase professional backup
software, a backup server, and a scalable backup device. The initial
costs for purchasing the required equipment and the time required to set
up the backup environment can be substantial. However, after the backup
environment is configured, it is rather easy to maintain. Centralized backups also offer substantial time savings for administrators because the backup process itself can be fully automated.
Like its predecessors, Windows Server 2012 has several automatic
repair features. If the boot manager or corrupted system file is
preventing startup, the Startup
Repair tool is started automatically and will initiate the repair of
the server. The Startup Repair tool can be helpful if one or more of the
following problems are preventing startup:
-
A virus infection in the master boot record -
A missing or corrupt boot manager -
A boot configuration data store with bad entries -
A corrupted system file
Although Startup Repair typically runs automatically, you can manually initiate this feature by completing the following steps:
-
If the computer won’t start normally, you’ll see a Windows Boot
Manager error screen stating that Windows failed to start. Press Enter. -
On the OS Selection screen, press F8. -
On the Advanced Boot Options screen, choose an appropriate safe mode
or other alternate mode to try to start the server so that you can log
in to diagnose and resolve the problem.
You also can use the installation disc to initiate recovery. To do so, follow these steps:
-
Insert the Windows Installation disc, and then boot from
the installation disc by pressing a key when prompted during startup.
If the server does not allow you to boot from the installation disc, you
might need to change firmware options to allow booting from a
CD/DVD-ROM drive. -
Windows Setup should start automatically. On the Install Windows
page, select the language, time, and keyboard layout options that you
want to use. Tap or click Next. -
When prompted, do not tap or click Install Now. Instead, tap or click
the Repair Your Computer link in the lower left corner of the Install
Windows page. -
On the Recovery screen, tap or click Troubleshoot. Then, on the
Advanced Options screen, tap or click Command Prompt to access the
MINWINPC environment. -
Change directories to x:\sources\recovery by typing cd recovery. -
Run the Startup Repair Wizard by typing startrep.
You can recover a server’s operating system or perform a full system recovery by using a Windows installation
disc and a backup that you created earlier with Windows Server Backup.
To initiate a recovery, on the Recovery screen, tap or click
Troubleshoot. Then, on the Advanced Options screen, tap or click System
Image Recovery.
With an operating system recovery, you recover all critical volumes
but do not recover nonsystem volumes. If you recover a full system,
Windows Server Backup reformats and repartitions all disks that are
attached to the server. Because of this, you should use this method only
when you want to recover the server data onto separate hardware or when
all other attempts to recover the server on the existing hardware have
failed.
Setting startup and recovery options
As part of planning for the worst-case scenarios, you need to consider how you want systems to start up and recover if a stop
error is encountered. The options you choose can add to the boot time
or they can specify that if a system encounters a stop error it does not
reboot.
You can configure startup and recovery options by completing the following steps:
-
In the Control Panel, tap or click System And Security\System to start the System utility. -
Tap or click Advanced System Settings. This opens the System Properties dialog box. -
On the Advanced tab, tap or click Settings in the Startup And Recovery panel. This displays the dialog box shown in Figure 1. -
In the Startup And Recovery dialog box, you configure the settings as follows:
-
If a server has multiple operating systems, you can set the default operating system
by selecting one of the operating systems in the Default Operating
System list. These options are obtained from the boot manager. -
When multiple operating systems are installed, the Time To Display
List Of Operating Systems option controls how long the system waits
before booting to the default operating system. In most cases, you won’t
need more than a few seconds to make a choice, so reduce this wait time
to perhaps 5 or 10 seconds. Alternatively, you can have the system
automatically choose the default operating system by clearing this
option. -
When you want to display recovery options, the operating system uses
the Time To Display Recovery Options When Needed setting to determine
how long to wait for you to choose a recovery option. The default wait
time is 30 seconds. If you don’t choose a recovery option in that time,
the system boots normally without recovery. As with operating systems,
you won’t need more than a few seconds to make a choice, so reduce this
wait time to perhaps 5 or 10 seconds. -
Under System Failure, you have several important options for determining what happens when a system experiences a stop
error. By default, the Write An Event To The System Log check box is
selected so that the system logs an error in the system log. The check
box appears dimmed, so it cannot normally be changed. The Automatically
Restart check box is selected to ensure that the system attempts to
reboot when a stop error occurs.
Important
In some cases, you might want the system to halt rather than reboot.
For example, if you are having problems with a server, you might want it
to halt so that an administrator will be more likely to notice that it
is experiencing problems. Don’t, however, prevent automatic reboot without a specific reason.
-
The Write Debugging
Information options allow you to choose the type of debugging
information that should be created when a stop error occurs. In most
cases, you will want debug information to be dumped so that you can use
it to determine the cause of a crash.
Important
If you choose a kernel
memory dump, you dump all physical memory being used at the time of the
failure. You can create the dump file only if the system is properly
configured. The system drive must have a paging file at least as large
as RAM and adequate disk space to write the dump file.
-
By default, dump files are written to the %SystemRoot% folder. If you want to write the dump file to a different location,
type the file path in the Dump File box. Select the Overwrite Any
Existing File option to ensure that only one dump file is maintained.
-
Tap or click OK twice to close all open dialog boxes.
|